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Abstract — The ability to efficiently and accurately detect ob- 
jects plays a very crucial role for many computer vision tasks. 
Recently, offline object detectors have shown a tremendous 
success. However, one major drawback of offline techniques 
is that a complete set of training data has to be collected 
beforehand. In addition, once learned, an offline detector can not 
make use of newly arriving data. To alleviate these drawbacks, 
online learning has been adopted with the following objectives: 

(1) the technique should be computationaUy and storage efficient; 

(2) the updated classifier must maintain its high classification 
accuracy. In this paper, we propose an effective and efficient 
framework for learning an adaptive online greedy sparse linear 
discriminant analysis (GSLDA) model. Unlike many existing 
online boosting detectors, which usually apply exponential or 
logistic loss, our online algorithm makes use of LDA's learning 
criterion that not only aims to maximize the class-separation 
criterion but also incorporates the asymmetrical property of 
training data distributions. We provide a better alternative for 
online boosting algorithms in the context of training a visual 
object detector. We demonstrate the robustness and efficiency 
of our methods on handwriting digit and face data sets. Our 
results confirm that object detection tasks benefit significantly 
when trained in an online manner. 

Index Terms — Object detection, asymmetry, greedy sparse 
linear discriminant analysis, online linear discriminant analysis, 
feature selection, cascade classifier. 

I. Introduction 

REAL-TIME object detection plays an important role 
in many real- world vision applications. It is used as 
a preceding step in applications such as intelligent video 
surveillance, content based image retrieval, face and activity 
recognition. Object detection is a challenging problem due to 
the large variations in visual appearances, object poses, illu- 
mination, camera motion, etc. All these issues have made the 
problem very challenging from a machine vision perspective. 
The literature on object detection is abundant. A through 
discussion on this topic can be found in several surveys: faces 
Q, (21, human and pedestrians O, IH, eyes fSl, vehicles |6|, 
etc. In this paper, we review only the most relevant visual 
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detection work, focusing on algorithms that operate directly 
on classification based visual object detection and incremental 
learning. 

Object detection problems are often formulated as clas- 
sification tasks where a sliding window technique is used 
to scan the entire image and locate objects of interest Q, 
El, El. Viola and Jones |8| proposed an efficient detection 
algorithm based on the AdaBoost algorithm and a cascade 
classifier. Their detector is the first highly- accurate real-time 
face detector. They trained classifier on data sets with a few 
thousand faces and a large number of negative non-faces. 
During the training procedure, negative samples are gradually 
bootstrapped and added to the training set of the boosting 
classifiers in the next stage. This method yields an extremely 
low false positive rate. A large number of faces and non-faces 
are used to cover different face appearances and poses, and the 
huge non-face possibilities. As a result, the computation cost 
and memory requirements for training an AdaBoost detector 
are unacceptably high. Viola and Jones spent weeks on training 
a detector with 6,060 features (weak learners) on a face 
training set of 4, 916. 

To speed up the training time bottleneck, a few approaches 
have been proposed. Pham and Cham ifTOll reduced the training 
time of weak learners by approximating the decision stumps 
with class-conditional Gaussian distributions. Wu et al. |11| 
introduced a fast implementation of the AdaBoost method 
and proposed forward feature selection (FFS) for fast training. 
FFS ignores the reweighting step in boosting such that weak 
classifiers only need to be trained for once. Xiao et al. |[T2l 
applied distributed learning to learn their proposed dynamic 
cascade framework. They use over 30 desktop computers 
for parallel training. They managed to train a face detector 
on a training set with 500,000 positive samples and 10 
billion negative samples in under 7 hours. However, these 
techniques are not applicable to some real-world applications 
where a complete set of training samples is often not given 
in advance. Re-training the model each time new data arrive 
would increase the time complexity by the factor of N , where 
N is the number of newly arrived samples. Hence, developing 
an efficient adaptive object detector has become an urgent 
issue for many applications of object detection in diverse 
and changing environments. To alleviate this problem, online 
incremental learning algorithms have been proposed for this 
purpose. 

Online learning was firstly introduced in the computational 
learning community. Since, boosting is one of the classifiers 
that have been successfully applied to many machine learning 
tasks, there has been considerable interest in applying boosting 



IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. XX, NO. XX, MARCH 200X 



techniques on problems that require online learning. The 
first online version of boosting algorithms was proposed in 
|[T3l . The algorithm works by minimizing the classification 
error while updating the weak classifiers online. Grabner and 
Bischof 1 14l later applied online boosting to object detection 
and visual tracking. Based on Oza and Russel's online boost- 
ing (TSj, they proposed an online feature selection method, 
where a group of selectors is initialized randomly, each with 
its own feature pool. By interchanging weak learners based 
on lowest classification error, the algorithm is able to capture 
the change in patterns induced by new samples. Huang et al. 
ifTSl proposed an incremental learning algorithm that adjusts 
a boosted classifier with domain-partitioning weak hypotheses 
to online samples. They showed that by incremental learning 
with few difficult unseen faces {e.g., faces with sun glasses or 
extreme illumination), the performance of the online detector 
can be significantly improved. Parag et al. [16] advocated 
an online boosting algorithm where the parameters of the 
weak classifiers are updated using weighted linear regressor 
to minimize the weighted least square error, in the context of 
pedestrian detection, Liu and Yu |17| introduced a gradient- 
based feature selection approach where the parameters of 
the weak classifiers are updated using gradient descent to 
minimize weighted least square error. Nonetheless, most of 
these proposed techniques concentrated on the application of 
visual tracking or object classification with small training sets 
and few online data setQ Hence, to date, it remains unclear 
whether there is any improvement in object detection by 
continuously updating the existing models with a sufficiently 
large training sample set. We will reveal this mystery in 
Section [IirEl 

Recently, Moghaddam et al. fTSl presented a technique that 
combines the greedy approach with the efficient block ma- 
trix inverse formula. The proposed technique, termed greedy 
sparse linear discriminant analysis (GSLDA), speeds up the 
calculation time by 1000 x compared with globally optimal 
solutions found by branch-and-bound search in the case of 
binary-classification problems. Paisitkriangkrai et al. 1 19] later 
applied the GSLDA algorithm to face detection and showed 
very convincing results. Their GSLDA face detector has shown 
to outperform AdaBoost based face detector due to the nature 
of the training data (the distribution of face and non-face 
samples is highly imbalanced). The objective of this work is to 
design an efficient incremental greedy sparse LDA algorithm 
that can accommodate new data efficiently while preserving a 
promising classification performance. 

Unlike classical LDA where a lot of online learning tech- 
niques have been designed and proposed 1201 , 1211 , 1221 . 
there are very few works on incremental learning for sparse 
LDA. One of the difficulties might be due to the fact that 
the sparse LDA problem is non-convex and NP-hard. It is not 
straightforward to design an incremental solution for sparse 
LDA. In this work, we design an algorithm that efficiently 
learns and updates the sparse LDA classifier. Our online sparse 
LDA classifier not only incorporates new data efficiently but 

^For example, in flTI . the authors trained the initial classifier with 366 
positive samples and 2, 540 negative patches, and incrementally updated with 
366 online positive samples and 2, 450 online negative patches. 



also yields an improvement in classification accuracy as new 
data become available. In brief, we extend the work of |[T9l 
with an efficient online update scheme. Our method modifies 
the weights of linear discriminant functions to adapt to new 
data sets. This update process generalizes the weights of linear 
discriminant functions and results in accuracy improvements 
on test sets. 

The key contributions of this work are summarized as 
follows. 

• We propose an efficient incremental greedy sparse LDA 
classifier for training an object detector in an incremental 
fashion. The online algorithm integrates the GSLDA 
based feature selection with our adaptation schemes for 
updating weights of linear discriminant functions and the 
linear classifier threshold. Our updating algorithm is very 
efficient. We neither replace weak learners nor throw 
away any weak learners during updating phase. 

• Our online GSLDA serves as a better (in terms of 
performance) alternative to the standard online boosting 
11131 for training detectors. To our knowledge, it is the 
first time to apply the online sparse linear discriminant 
analysis algorithm to object detection. 

• Finally, we have conducted extensive experiments on 
several data sets that have been used in the literature. The 
experimental results confirm that incremental learning 
with online samples is beneficial to the initial classifier. 
Our algorithm can efficiently update the classifier when 
the new instance is inserted while achieving comparable 
classification accuracy to the batch algorithnj^ Our find- 
ings indicate that online learning plays a crucial role in 
object detection, especially when the initial number of 
training samples is small. Note that when trained with 
few positive samples, the detector often under-performs 
since it fails to capture the appearance variations of the 
target objects. By applying our online technique, the 
classification performance can be further improved at the 
cost of a minor increase in training time. 

The rest of the paper is organized as follows. Section |Il| 
begins by introducing the concept of LDA and GSLDA. We 
then propose our online GSLDA object detector. The results 



of numerous experiments are presented in Section III We 



conclude the paper in Section IV 



II. Algorithms 

For ease of exposition, the symbols and their denotations 
used in this paper are summarized in Table [l| In this section, 
we begin by introducing the basic concept of classical linear 
discriminant analysis (LDA) and greedy sparse linear discrim- 
inant analysis (GSLDA). We then propose our online greedy 
sparse LDA (OGSLDA). 

A. Classical Linear Discriminant Analysis 

Linear discriminant analysis (LDA) deals with the prob- 
lem of finding weights of linear discriminant functions. Let 

^We use the terms "batch learning" and "offline learning" interchangeably 
in this paper. 
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TABLE I 
Notation 



Notation 



A^ 

Ni,N2 

M 

T 
X 

X 

rfi 
mi, 1712 

Ml,M2 
cri,cr2 
Sb, Sb 

W 

Wo 



Description 



Class 1 (positive class), class 2 (negative class) 

Number of training samples in each classifier 

(cascade layer) 

The number of training samples in first and second class, 

respectively 

The size of the feature sets (for decision stumps, this is 

also equal to the number of weak learners) 

The number of features to be selected 

Data matrix 

The new instance being inserted 

The global mean of the training samples 

The mean (centroid) of the first and second class, 

respectively 

The covariance of the first and second class 

The projected mean of the first and second class 

The projected covariance of the first and second class 

Between-class scatter matrix and its updated value 

after the new instance x has been inserted 

Within-class scatter matrix and its updated value 

Weights of linear discriminant functions (also referred 

to as weak learners' coefficients in the context) 

The linear classifier threshold 



US assume that we have a set of training patterns x = 
[xi^X2j ...jXmV where each of which is assigned to one 
of two classes, Ci and C2. We can find a weight vector 

IT 



W = [wi,W2, 



W^ X 



W X 



(1) 



and a threshold wq such that 

-wo> {x e Ci), 
-i^o<0 (CCGC2). 

In general, we seek the vector [wq^ i^i, 1^2, •••, u)m] that best 
satisfies ([T]). The data are said to be linearly separable if for 
all X, ^ is satisfied. 

An intuitive objective that one can take is to find a linear 
combination of the variables that can separate the two classes 
as much as possible. The computed linear combination reduces 
the dimensions of the samples to one dimension. The classical 
criterion proposed by Fisher is the ratio of between-class to 
within-class variances, which can be written as 



J = 



T,c€C Exec('^^"^c - ■w^rn)^ 


w^ SbW 


(2) 


T.cec Exec('^"^^ - w^rue)^ 


y ^ Ncirric — fh){mc — fhj^ , 




(3) 


cec 






^ ^ ^ ^ (x — mc){x — rric)^ . 




(4) 



Sb 



cec xec 
Here, rric is the mean of class c, m is the global mean, 
Nc is the number of instances in class c, S^ and S^ are the 
so-called between-class and within-class scatter matrices. The 
numerator of ^ denotes the distance between the projected 
means and the denominator denotes the variance of the pooled 
data. We want to find linear projections w that maximizes 
J, the distance between the means of the two classes while 
minimizing the variance within each class. The solution can 
be obtained by generalized eigen-decomposition (GEVD). The 
optimal solution w is the eigenvector corresponding to the 
maximal eigenvalue and can be expressed as 1.23 ll : 

w (X S~^{mi -1712). (5) 



If we further assume that the data are normally distributed 
and that the distributions in the original space have identical 
covariance matrices, an optimal threshold, wq, can be calcu- 
lated from 
1 



Wo 



-(mi + 7712)^6'^^ (mi - m2) 






(6) 



Here, Pr(Ci) and Pr(C2) are priori probabilities of class 
Ci and C2, respectively. This threshold can be interpreted as 
the mid-point between the two projected means, shifted by 
the log of the ratio between the priori probabilities of the two 
classes. 

B. Greedy Sparse Linear Discriminant Analysis 

In this section, we briefly present the offline implementation 
of the greedy sparse LDA algorithm 1241 . ifTSl . The sparse 
version of classical LDA is to solve 



maximize -^ ^ , 
subject to Card(ii;) 



(7) 



k, 



where Card(ii;) = /c is an additional sparsity constraint, 
Card(-) denotes io norm, k is an integer set by a user. 
Due to this additional sparsity constraint, the problem is non- 
convex and NP-hard. In |24|, Moghaddam et al. presented 
a technique to compute optimal sparse linear discriminants 
using branch-and-bound approach. Nevertheless, finding the 
exact global optimal solutions for high dimensional data is 
infeasible. The algorithm was extended in |18| with new 
sparsity bounds and efficient block matrix inverse techniques 
to speed up the computation time by 1000 x. The technique 
works by sequentially adding the new variable which yields 
the maximum eigenvalue (forward selection) until the number 
of nonzero components, Card(ii;), is equal to the integer set 
by the user. 

In 1 19], Paisitkriangkrai et al. learn an object detector using 
GSLDA algorithm. The training procedure is described in 
Algorithm [T] First, the set of selected features is initialized 
to an empty set. The algorithm then trains all weak learners 
and store their results into a lookup table (line 1 — 2). At every 
round, the output of each weak learner is examined and the 
weak learner that most separates the two classes is sequentially 
added to the list (line 3-4). Mathematically, Algorithm [T] 
sequentially selects the weak learner whose output yields the 
maximal eigenvalue. Weak learners are added until the target 
learning goal is met. The authors of |19| use an asymmetric 
node learning goal to build a cascade of GSLDA object 
detector. 

C. Incremental Learning of GSLDA Classifiers 

The major challenge of GSLDA object detectors in real- 
world applications is that a complete set of training samples 
is often not given in advance. As new data arrive, the between- 
class and within-class scatter matrices, S^ and S^, will change 
accordingly. In offline GSLDA, the value of both matrices 
would have to be recomputed from scratch. However, this 
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Algorithm 1 The training procedure for building an offline 
GSLDA object detector. 

Input: 

• A positive training set and a negative training set; 

• A set of features {/^; i G [1, M]}; 
Output: 

• A set of selected weak learners {hi;i G [1,7"]} that best 
separates the training set; 

1 foresich feature do 

Train a weak learner (e.g., a decision stump parameterized 
by a threshold that results in the smallest classification 
error) on the training set; 

3 while the target goal is not met do 

Add the best weak learner (e.g., decision stump) that yields 
the maximum class separation to the set of selected weak 
learners; 



approach is unacceptable due to its heavy computation and 
storage requirements. First, the cost of computing both matri- 
ces grows with the number of training samples. As a result, 
the algorithm will run slower and slower as time progresses. 
Second, the batch approach uses the entire set of training data 
for each update. In other words, the previous training data 
needs to be stored for the retraining purpose. 

In order to overcome these drawbacks, we propose an 
online learning algorithm, termed online greedy sparse LDA 
(OGSLDA). The OGSLDA algorithm consists of two phases: 
the initial offline learning phase and the incremental learning 
phase. The training procedure in the initial phase is similar to 
the algorithm outlined in Algorithm [T] Here, we assume that 
the number of training samples available initially is adequate 
and well represents the true density. In the second phase, 
the learned covariance matrices are updated in an incremental 
manner. 

It is important to point out that a number of incremental 
LDA-like approximated algorithms have been proposed in 
lETl, f25l. Ye et al. |21| proposed an efficient LDA-based 
incremental dimension reduction algorithm which applied QR 
decomposition and QR-updating techniques for memory and 
computation efficiency. Kim et al. |25 1 proposed an incremen- 
tal LDA by applying the concept of the sufficient spanning 
set approximation in each update step. However, we did not 
find any of the existing LDA-like algorithms appropriate to 
our problems. Based on our preliminary experiments, the 
projection matrix determined in subspace often gives worse 
discriminant power than that from full space. This might be 
due to their dimension reduction algorithms which reduced 
between-class and within-class scatter matrices to a much 
smaller size. Our online GSLDA guaranteed to build the 
same between-class and within-class scatter matrices as batch 
GSLDA given the same training data. The reason why we 
need not worry about large dimensions in our algorithms is 
because applying sparse LDA in our initial phase already 
reduces the number of dimensions we have to deal with. 
Hence, given the same set of features, the accuracy of our 
online GSLDA is better than the existing incremental LDA- 
like approximated algorithms. The only expensive computation 



left in our algorithms is eigen-analysis. In order to avoid the 
high computation complexity of continuously solving gener- 
alized eigen-decomposition, we applied the efficient matrix 
inversion updating techniques based on inverse Sherman- 
Morrison formula. As a result, our incremental algorithm is 
very robust and efficient. 

In this section, we first introduce an efficient method that 
incrementally updates both within-class and between-class 
scatter matrices as new observations arrive. Then, an approach 
used to update the classifier threshold is described. Finally, 
we analyze the storage and training time complexity of the 
proposed method. 

1) Incremental update of between-class and within-class 
matrices: Since, GSLDA assumes Gaussian distribution, the 
incremental update of class mean and class covariance can be 
computed very quickly. The techniques used to update both 
matrices can be easily derived. The procedure proceeds in 
three steps: 

1) Updating between-class scatter matrix, S})\ 

2) Updating within-class scatter matrix, S^ \ 

3) Updating inverse of within-class scatter matrix, S~^. 

Updating between-class scatter matrix: The definition of the 
between-class scatter matrix is given in ([3]). For 2 classes (Ci 
and C2), Sh can be simplified to 



^6 



N1N2 

N 



(mi - 7722) (mi - m2)^ 



(8) 



The expression can be interpreted as the scatter of class 1 with 
respect to the scatter of class 2. Let x be a new instance being 
inserted. The updated mi and m2 can be calculated from 



mi = 



m,2 



m^i + 


A/'i+l 


mi 




m2 




m,2 + 


x — m2 



if X e Ci] 
otherwise, 

if X e Ci; 
otherwise. 



(9) 



Updating within-class scatter matrix: The covariance of a 
random vector X is a square matrix S where S = E[{X — 
E[X]){X - E[X]y]. Given the new instance x, the updated 
covariance matrix is given by 



;^iTnT 



S = {[X,x] -ml' ){[X,x] -ml') 



(10) 



Here, m is an updated mean after new instance has been 
inserted and 1 is a column vector with each entry being 1. 
Its dimensionality should be clear from the context. Note that 
in ([To]), we leave out the constant term since it makes no 
difference to the final solution: 



[X,x] -ml' = [X,x] -ml' 



ml' 



mr 



[X — ml^ ^x — m] — (m — m) 1^ . 
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Substitute the above expression into ([TO]) and Xoi u = x — m 
and V = rh — m, 

E = {[X -mf,u] -vf){[X-mf,u] -vff 

= {[X - ml^.u] [X - mf, n]^ - [^ - ml^,u]{vff 

- {vl^)[X - mf,u]^ + {vl^){vfV 
= S + nn^ - [X -mf,u]lv^ 

- v{[X - mf,u]lf + (TV + l)vv^ 
= S + uvJ - {Nm - Nm + u)v^ 

- v{Nm - Nm + uf + (TV + l)vv' 
= S + uu^ - uv^ - vvJ + (TV + l)vv^ 
= E + (u - v){u - vf + Nvv^ 

= H + (ic — rh){x — rhY + N{rh — rn){rh — m)^. 



(11) 



Note that XI = ml 1 = Nm. Next, we consider updating 
within-class scatter matrix. Let cc be a new instance being 
inserted. The updated matrix, S^, can be calculated from 




112 ifxeCi] 
Tj2 Otherwise. 



(12) 



Updating inverse of within-class scatter matrix: As men- 
tioned in 1 18] that the computational complexity of 2-class 
GSLDA relies heavily on the calculating of within-class scatter 
matrix inversion. In order to update the matrix inversion 
efficiently, we make use of the technique called Balanced In- 
complete Factorization which was based on inverse Sherman- 
Morrison formula proposed by Bru et al. in |26|. Let S be the 
square matrix of size M x M which can be written as 



So 



PiQ[ 



P2Q2- 



(13) 



Here, we assume that Hq is nonsingular and Pi^P2^Qi^Q2 ^ 
l^ . The inverse of S is given by 



s-i = s-^ - j:-^ud-^v^j:-^ 



(14) 



where D 



r. 







V 



Qi Q2 



r2 



r. 



1 



Q2 



, u = 


[Pi P2 n PiJ 


, ri = l + gjSo Vi, 


qI^o^Pi 


9l) ^0^P2- 



ri 



The updated inverse of within-class scatter matrix can be 
written as 



b,, 



S-'-S-'UD-WS-' (15) 

m, P2 = Nc{m — m) and ^2 = rh — m 



where Pi = Qi = x- 
(from ^ and ([13])) 

2) Updating weak learners' coefficients and threshold: 
Given the updated within-class matrix, S~^, and between- 
class matrix, S^, the updated weights of linear discriminant 
functions can now be calculated from matrix- vector multipli- 
cation using ([5]). To complete the linear classifier, the threshold 
1^0 has to be obtained. Three criteria can be adopted. The 
first criterion is to apply the optimal Bayesian classifier in 



the projected space. In other words, the selected threshold 
should be the value in which the one-dimensional distribution 
functions in the projected lines are equal. The mean and 
variance in the transformed space can be calculated as 



Mc 



w rric 



cFc = w S^ii;. 



(16) 



If we let Xi - A/'(/ii,a^) and X2 - A/'(/i2,cr|). The 
optimal threshold is calculated as the point in which the one- 
dimensional density function of two classes are equal. Let 
logPr(xi) = logPr(x2). After some algebraic expansions 
and simplifications, we can write the expression in the second- 
order polynomial. 



where a 



1 

2cr2 



1 

2a% 



hx 



Ml 



Z£2 







and c - 



^+log(a2)- 



2cr 



Ml 
2cr2 



log(cri). The quadratics have two roots. 



X = 



-b ± Vb^ - 4.ac 
2a * 



In our implementations, we choose the threshold, wq, to be 
the value between the two class means. 



wq = X where /ii < x < 112- 



(17) 



The second criterion is to choose the threshold which 
yields high detection rate with moderate false alarm rate. This 
asymmetric criterion is often adopted in cascade framework 
|8|. Let (l){Z) = — 1= J_^exp(— ^ii^)(iii be the cumulative 
distribution function (CDF) of the standard normal random 
variable Z. If X - A/'(/ii, af ), the CDF of X is (/)(Z) where 
Z = ~^^ . Let the miss rate by p, the threshold which yields 
1 — p detection rate can be calculated as 



'^0 = Ml + Zai = /ii + ^{p)cri. 



(18) 



The last criterion is to set the threshold to be the projected 
mean of the negative classes. This threshold helps us ensure 
the target asymmetric learning goal (moderate (50%) false 
positive rate with high detection rate). The threshold for the 
last criterion is 



Wo = 112' 



(19) 



The above three threshold updating rules might look over- 
simple. However, in |27|, a few numerical simulations were 
performed on multi-dimensional normally distributed classes 
and real-life data taken form UCI machine learning repository. 
It is reported that selecting threshold using the simple approach 
as ^Vl) often leads to smaller classification error than the 
traditional Fisher's approach ([6]). 

Unlike many online boosting algorithms which modify the 
parameters of the weak learners to adapt to new dataset. For 
example, in |14|, the parameters of the weak learners are 
updated using Kalman filtering; Parag et al. [1I61I updated the 
parameters using linear regression; Liu and Yu ifTTl updated 
the parameter using gradient descent, etc. We have found that 
extreme care has to be taken when we consider updating weak 
learners' parameters for application of object detection. To 
demonstrate this, we generate an artificial asymmetric data set 
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similar to the one used in ||28B . We then learn two different 
incremental linear weak classifiers with different parameter 
updating schemes: 

1) Incrementally update the model based on Gaussian dis- 
tribution similar to |[T4l : 

2) Incrementally update linear coefficients and intercept to 
minimize least square error (LSE) using linear regres- 
sion similar to |[T61 (here, we assume uniform sample 
weights). 

In this experiment, each weak learner represents a linear 
function with different coefficients (slopes). Each weak learner 
has one updatable parameter, i.e., linear classifier threshold 
(intercept). We apply GSLDA algorithm and select the weak 
learner with minimal classification error. Based on the selected 
weak learner, we continuously insert new samples and update 
the linear classifier threshold. Fig. [T] plots 9 different linear 
classifier thresholds. Top row shows the linear classifier with 
no parameter updating. Middle row shows the linear classifier 
with Gaussian updating rule. Bottom row shows the linear 
classifier using the linear regression algorithm. The first col- 
umn shows the classifier thresholds on the initial training set. 
The middle and last columns show the classifier thresholds 
with new data being inserted. We found that the top two 
classifier thresholds (no update and Gaussian) perform very 
similarly. LSE seems to perform worse when more new data 
are inserted. The reason may be attributed to the asymmetry of 
the data. When the data are linearly separable, we can see that 
the regressor works very well. Based on our results, we feel 
that parameter updating algorithms could significantly weaken 
the performance of weak learners if not applied properly. 
Hence, in this work, we decide not to update the parameters of 
the weak learners in our algorithms. Clearly, another benefit 
is faster computation with no updating the weak learners' 
parameters. 

The online GSLDA framework is summarized in Algo- 
rithm [2] Note that here we only use forward search of the orig- 
inal GSLDA algorithm of El, (HI. In Ull, we have shown 
that forward selection plus backward elimination improve the 
detection performance slightly but with extra computation. 

3) Incremental Learning Computational Complexity: 
Since, the initial training of online GSLDA is the same as 
offline GSLDA, we briefly explain the time complexity of 
GSLDA |[T9ll . Let us assume we choose decision stumps 
as our weak learners. Let the number of training samples 
be N. Finding an optimal threshold of each feature needs 
0{N log A^)[^ Assume that the size of the feature set is M. The 
time complexity for training weak learners is 0{M N log N). 
During GSLDA learning, we need to find mean 0{N), vari- 
ance 0{N) and correlation 0{T'^) for each feature. Since, 
we have M features and the number of weak learners to be 
selected is T, the total time complexity for offline GSLDA is 
0{MN log N + MNT + MT^). 

Given the selected set of weak learners, the time complexity 
of online GSLDA when new instance is inserted can be 
calculated as follows. Since, the number of weak learners 

^One usually sorts the ID features using Quicksort, which has a complexity 

O(A^logA^). 



Algorithm 2 The online GSLDA Algorithm. 

Given: 

• The initial set of weak learners {hi;i G [1,2^]} trained using 
offline GSLDA on small initial data; 

Input: 

• New training datum / and its corresponding class label 

ye {1,2}; 

• The current between-class covariance matrix, Sb\ 

• The inverse of within-class covariance matrix, S~'^; 
Output: 

• The updated between-class covariance matrix, Sb', 

• The updated inverse of within-class covariance matrix, S~'^; 

• The updated weak learners' coefficients, w; 

• The classifier threshold, wq 

1 Classify the new datum / using the given weak learners, 
x = [hiil),h2{l),--- ,hT{I)]; 

2 Update Sb with x using |8]) and |9]); 

3 Update S~'^ using |T5] l; 

4 Recalculate weak learners' coefficients, w, using |5]); 

5 Update classifier threshold, wq, based on node learning goal 
( (V7\ for minimal classification error, or mi n(|T8]l , |T9l )) for the 
asymmetric node learning goal (see Section |II-C2| )); 



is T, the total time complexity to calculate x in Step 1 
is 0{T). It also takes 0{T) to update the class mean in 
Step 2. At Step 3, calculating U, V, rl, r2 take 0{T'^). In 
this step, the order in which we calculate the matrix-matrix 
multiplication affects the overall efficiency. Since, we are 
dealing with a small matrix chain multiplication, it is possible 
to go through each possible order and pick the most efficient 
one. For ([14]), we perform matrix-matrix multiplication in the 
following order {{{J:q^U)D-^){V^^q^)). The number of 
operations required to compute (11^^/7) is 0(T x T x 2), 
{l^Q^U)D-^) is 0{T X 2 X 2), (V^Sq"^) is 0(2 x T x T) 
and (((Eo"^/7)L)-^)(V^So"^)) is 0{T x 2 x T). Hence, the 
complexity of updating matrix inversion is still in the order 
of 0(T'^). Since, the size of within-class matrix is T x T, the 
matrix- vector multiplication in Step 4 takes 0{T'^). Updating 
classifier threshold in Step 5 takes 0{T'^) for the first criterion 
(First, we find the projected mean and covariance, 0{T) and 
0{T'^^T), respectively. Then, we solve a closed-form second- 
degree polynomial). The second criterion in Step 5 takes 
0{T'^) (Again, the time complexity of projected mean and 
covariance is 0{T) and 0{T'^ + ^)). The third criterion in 
Step 5 takes 0{T) (Here, we only have to calculate the dot 
product of two vectors). Hence, the time complexity of Step 5 
is at most 0{T'^). Therefore, the total time complexity for 
online GSLDA with the insertion of a new instance is at 
most 0(7VoMlog A^o + NqMT + MT^ + T^ ). Here, A^o is 

Offline Online 

the number of initial training samples which assumed to be 
small. Note that the speed-up of online GSLDA over batch 
GSLDA is noticeable, i.e. 0{NT'^) <C 0(7V^log7V), when 

Online Batch 

more instances are inserted into the training set (A^ ::^ A^o)- 

In terms of memory usage, between-class scatter matrix 
takes up 0(2T). The inverse of within-class scatter matrix 
occupies 0{T'^). For the first and second criteria in Step 5, we 
also need to keep the covariance matrices of Si and T>2 which 
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Fig. 1. Toy data set. x's and o's represent positive and negative samples, respectively. Top row: No update. The parameters of weak learners do not get 
updated. Middle row: Gaussian model. Linear classifier threshold is calculated from updated mean and variance (using {TTJ). Bottom row: Least square 
error. Linear classifier threshold is updated using linear regression. The leftmost column shows the classifier thresholds on the initial training set (50 positive 
and 50 negative training points). The middle column shows the classifier thresholds with 25 new positive and 25 new negative points inserted. The rightmost 
column shows the thresholds with 50 new positive and 50 new negative inserted. Due to the asymmetry of the data distributions, updating the parameters of 
the weak learners could result in performance deterioration. 



takes up 0(2T^). Hence, the extra memory requirements for 
online GSLDA are at most ©(ST^ + 2T). Given that the 
selected number of weak classifiers in each cascade layer is 
often small (T < 200), the time and memory complexity of 
online GSLDA is almost negligible. 

III. Experiments 

This section is organized as follows. The datasets used in 
this experiment, including how the performance is analyzed, 
are described. Experiments and the parameters used are then 
discussed. Finally, experimental results and analysis of differ- 
ent techniques are presented. 

A. USPS Digits Classification 

We compare online GSLDA against batch GSLDA for 
classification of 16 x 16 pixels USPS digits '3' and '5'. 
The data set consists of 406 training instances and 418 test 
instances for the digit '3', 361 training instances and 355 test 
instances for digit '5' 1291 . We use the raw intensity value as 
the features. Hence, the total number of features is 256. For 
batch learning, we applied greedy approach to sequentially 
select feature which yields maximal class separation (forward 
search). We then evaluate the performance of the classifier 



on the given test set and measure the error rate ifTSl . For 
online learning, we randomly select 30/50/70 percent training 
samples as the training set. Incremental updating is performed 
with the remaining training instances being inserted one at 
a time. We use decision stumps as the weak learners for 
both classifiers. All experiments, except batch GSLDA (trained 
with full training sets), are run 10 times. The mean of the 
classification errors are plotted. 

Figs. |2(a)[ |2(b)| and |2(c)| show the achieved classification 
error rates by batch GSLDA and online GSLDA. In the 
figures, the horizontal axis shows the £o norm of the feature 
coefficients, i.e., the number of weak classifiers, and the 
vertical axis indicates the classification error rate on test data. 
We observe a trend that the error rate decreases when we 
train with more training instances. It is important to point 
out that in this experiment the error rate of online GSLDA 
is quite close to that by batch GSLDA. We also train offline 
GSLDA classifiers with 30%, 50% and 70% training data. We 
observe an increase in error rates of GSLDA (30% training 
data) when the number of dimensions increase. This is not 
surprising since it is quite common for a classifier to overfit 
with large dimensions and small sample size. 

We compare the performance of online GSLDA with online 
boosting proposed in |13|. For each weak classifier, we build 
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Fig. 2. Top; Classification error rates by offline GSLDA and online GSLDA on 16 x 16 pixels USPS digits data sets f29l. The number of initial training 
data for online GSLDA is (a) 30%, (b) 50%, (c) 70% of the available training data. All experiments, except batch GSLDA (trained with full training sets), 
are run 10 times. The mean of the errors are plotted. Bottom: Classification error rates by online GSLDA and online boosting 1 13 1. The number of initial 
training data is (d) 30%, (e) 50%, (f) 70% of the available training data. All experiments are run 10 times. 



a model by estimating the univariate normal distribution with 
weighted mean and variance for digits '3' and '5'. We update 
the weak classifier by incrementally updating the mean and 
variance using weighted version of ^ and ([TO]). The results 
of online boosting are shown in Figs. |2(d)| |2(e) and |2(f)| The 
test error of online boosting decreases as the initial number of 
training samples increases. We observe that the performance of 
online boosting to be remarkably worse than the performance 
of online GSLDA. 

Figs. |3(a) and 3(c) shows the achieved classification error 
rates by batch GSLDA and online GSLDA with 25 and 100 
dimensions (features). In the figure, the horizontal axis shows 
the portion of training data instances and the vertical axis 
indicates the classification error rate. We observe a trend that 
the error rate decreases when more and more training data 
instances are involved, as expected. Online GSLDA not only 
performs well on this dataset but it is also very efficient. We 
give a comparison of the computation cost between batch 
GSLDA and incremental GSLDA in Figs. |3(b)l and [3(d) 
As can be seen, the execution time of online GSLDA 



IS 



significantly smaller than that of batch GSLDA as the number 
of training samples grows. 



B. Frontal Face Detection 

Due to its efficiency, Haar-like rectangle features fSl have 
become a popular choice as image features in the context of 
face detection. Similar to the work in |8|, the weak learning 
algorithm known as decision stumps and Haar-like rectangle 



features are used here due to their simplicity and efficiency. 
The following experiments compare offline GSLDA and online 
GSLDA learning algorithm. 

1) Performances on Single-node Classifiers: We conduct 
two experiments in this section. The first experiment compares 
single strong classifier learned using AdaBoost |8|, Asym- 
Boost 1281 , offline GSLDA |[T9l and our proposed online 
GSLDA algorithms. The datasets consist of 1,000 mirrored 
face examples (Fig. |6]) and 10,000 bootstrapped non-face 
examples. The face were cropped and rescaled to images of 
size 24 X 24 pixels. For non-face examples, we initially select 
1,000 random non-face patches from non-face images. The 
other 9, 000 non-face patches are added to the initial pool of 
training data by bootstrapping 

We train three offline face detectors using AdaBoost, Asym- 
Boost and GSLDA. Each classifier consists of 200 weak 
classifiers. The classifiers are tested on a challenged face 
videos, David Ross indoor data set and trellis data sej^ which 
are publicly available on the internet. Both videos contain large 
lighting variation, cast shadows, unknown camera motion, and 
tilted face with in-plane and out-of-plane rotation. The first 
video contains 761 frames of a person moving from a dark to 
a bright area. Since, the first few video frames has very low 
contrast (almost impossible to see faces), we ignore the first 
100 frames. The second video contains 501 frames of a person 
moving underneath a trellis with large illumination change and 

^We incrementally construct new non-face samples using a trained classifier 

of El. 



http://www.cs.toronto.edu/^dross/ivt/| 
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Fig. 3. Comparison of classification error rate and computation cost between online GSLDA and batch GSLDA on 16 x 16 pixels USPS digits data sets 
^29 1 . We set the number of nonzero components of the feature coefficients (-^o norm) to 25 (a,b) and 100 (c,d)- 



cast shadows. 

In this experiment, we use the scanning window technique 
to locate faces. We set the scaling factor to 1.2 and window 
shifting step to 1. The patch with highest classification score is 
classified as faces. In other words, there is only one selected 
face in each frame. The criteria similar to the one used in 
PASCAL VOC Challenge 1 30] is adopted here. Detections are 
considered true or false positives based on the area of overlap 
with ground truth bounding boxes. To be considered a correct 
detection, the area of overlap between the predicted bounding 
box, 5p, and ground truth bounding box, Bgt, must exceed 
50% by the formula: 

aTea{B^ n B^t) ^ ^^^^^ 
area{Bp U Bgt) 

For online GSLDA, the predicted faces in the previous frames 
are used to update the GSLDA model. Note that the updated 
patches could contain both true positives (faces) and false 
positives (misclassified non-faces). After the update process, 
the classifier predicts a single patch with highest classification 
score in the next frame as the face patch. This learning 
technique is similar to semi-supervised learning where the 
classifier makes use of the unlabeled data in conjunction with 
a small amount of labeled data. Note that unlike the work 



in 1^141 where both positive and negative patches are used 
to incrementally update their model, we only make use of 
positive patches. 

Table |Il| compares the four face detectors in terms of their 
performance. We observe that the performance of AdaBoost 
face detector is the worst. This is not surprising since the 
distributions of training data are highly skewed (1,000 faces 
and 10, 000 non-faces). Viola and Jones also pointed out this 
limitation in |28|. Face detectors trained using AsymBoost and 
GSLDA perform quite similar on the first video. The results 
are consistent with the ones reported in |[T9l . Our results show 
that online GSLDA performs best. Based on our observations, 
incrementally updating GSLDA model improves the detection 
results significantly at small increase in computation time. 
Fig. |4] compares the empirical results between offline GSLDA 
and our proposed online GSLDA. 

Finally, we compare the Receiver Operating Characteristic 
(ROC) curves between the offline GSLDA model (1, 000 faces 
and 10, 000 non-faces) and the online GSLDA model (initially 
trained with 1, 000 faces and 10, 000 non-faces + updated with 
661 patches classified as faces). In this experiment, we set 
the scaling factor to 1.2 and window stepping size to 1. The 
techniques used for merging overlapping windows are similar 
to in. Detections are considered true or false positives based 
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Fig. 4. A comparison of offline AdaBoost based frontal face detector 1 8 1 (Top row), AsymBoost based face detector 1 28 1 (Second row), GSLDA based face 
detector fT9l (Third row) and our proposed OGSLDA face detector (Last row). All detectors are trained initially with 1, 000 faces and 10, 000 non-faces. 
Online GSLDA is incrementally updated with patches classified as faces from the previous video frames. The first video (david indoor) contains 761 frames 
of a person moving from a dark to a bright area undergoing large lighting and pose changes (frames 150, 250, 350, 409, 450, 494 and 592). The second 
video (trellis) contains 501 frames of a person moving underneath a trellis with large illumination change (frames 50, 85, 182, 231, 287, 386 and 457). 



on the area overlap with ground truth bounding boxes. We 
shift the classifier threshold and plot the ROC curves (Fig. [5]). 
Clearly, updating the trained model with relevant training data 
increases the overall performance of the classifiers. 

In the next experiment, we compare the performance of 



single strong classifiers learned using offline GSLDA and 
online GSLDA algorithms on frontal faces database. The 
database consists of 10,000 mirrored faces. The faces were 
cropped and rescaled to images of size 24 x 24 pixels. For non- 
face examples, we randomly selected 10,000 random non-face 



TABLE II 

Performance on four different frontal face detectors on 

david indoor and trellis video 





detection rate 




indoor sequence 


trellis sequence 


AdaBoost 1 8| 
AsymBoost |28| 
GSLDA 1 191 
Our proposed OGSLDA 


57.8% 
68.7% 
70.3% 
83.1% 


35.3% 
37.5% 
48.5% 
62.1% 
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Fig. 5. Comparison of ROC curves between offline and online GSLDA on 
David Ross indoor data set (top) and trellis data set (bottom). 

TABLE III 

The size of training and test sets used on the single node 

classifier. 



# 



Train 

Test 



data splits faces/split non-faces/ split 



2000 
2000 



2000 
2000 



patches from non-face images obtained from the internet. The 
collected patches are split into three training sets and two test 
sets. Each set contains 2,000 face examples and 2,000 non- 
face examples (Table |lll]). For each experiment, three different 
classifiers are generated, each by selecting two out of three 
training sets and the remaining training set for validation. 

In this experiment, we train 30, 50 and 100 weak learners 
of Haar-like features. The performance is measured by the 
test error rate. The results are shown in Fig. [7] The following 
observations can be made from these curves. The error of both 





Fig. 6. A random sample of face images for training. 



classifiers drops as the number of training samples increases. 
The error rate of batch GSLDA drops at a slightly faster 
rate than online GSLDA. This is not surprising. For batch 
learning, the previous set of training samples along with a new 
sample are used to update the decision stumps every time a 
new sample is inserted. For each update, GSLDA algorithm 
throws away previously selected weak classifiers and reselects 
the new 30, 50 and 100 weak classifiers. As a result, the 
training process is time consuming and requires a large amount 
of storage. In contrast, online GSLDA relies on the initial 
trained decision stumps. The new instance does not update 
the trained decision stumps but the between-class and within- 
class scatter matrices. The process is suboptimal compared to 
batch GSLDA. However, the slight increase in performance of 
batch GSLDA over onHne GSLDA (0.7% drop in test error 
rate for 100 weak classifiers) comes at a much higher storage 
cost and significantly higher computation time. 

2) Performances on Cascades of Strong Classifiers: In this 
experiment, we use mirrored faces from previous experiment 
for batch learning and online learning. The number of initial 
positive samples used in each experiment is varied. We use 
500 faces, 1,000 faces and 5,000 faces to initially train a 
face detector. In each experiment, we trained four different 
cascaded detectors. The first cascaded detector is the same 
as in Viola and Jones O i.e., the face data set used in 
each cascade stage is the same while the non-face samples 
used in each cascade layer are collected from false positives 
of the previous stages of the cascade (bootstrapping). The 
cascade training algorithm terminates when there are not 
enough negative samples to bootstrap. 

The second, third and forth face detectors are trained ini- 
tially with the technique similar to the first cascaded detector. 
However, the second cascaded face detector is incrementally 
updated with new negative examples collected from false 
positives of the previous stages of cascade. The third cascaded 
face detector is incrementally updated with 5,000 unseen 
faces. The final face detector is incrementally updated with 
both false positives from previous stages and unseen faces. For 
each face detector, weak classifiers are added to the cascade 
until the predefined objective is met. In this experiment, we 
set the minimum detection rate in each cascade stage to be 
99% and the maximum false positive rate to be 50%. 

We tested our face detectors on the low resolution faces 
datasets, MIT+CMU frontal face test sets. The complete set 
contains 130 images with 507 frontal faces. In this experiment, 
we set the scaling factor to 1.2 and window shifting step to 
1. The techniques used for merging overlapping windows is 
similar to |8|. Detections are considered true or false positives 
based on the area of overlap with ground truth bounding 
boxes. To be considered a correct detection, the area of 
overlap between the predicted bounding box and ground truth 
bounding box must exceed 50%. Multiple detections of the 
same face in an image are considered false detections. 

Fig. [8] shows a comparison between the ROC curves 
produced by online GSLDA classifier. The ROC curves in 
Fig. |8(a)| show that online GSLDA classifier outperforms 
GSLDA classifier at all false positive rates when initially 
trained with 500 faces. Incrementally updating the GSLDA 
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Fig. 7. Comparison of classification error rates between batch GSLDA and online GSLDA. The number of weak learners (decision stumps on Haar-like 
features) in each experiment is (a) 30, (b) 50, (c) 100. The error of both classifiers drops as the number of training samples increases. 
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Fig. 8. Comparison of ROC curves on MIT+CMU face test set. The four detectors are trained using (a) 500 faces, (b) 1, 000 faces and (c) 5, 000 and 
10, 000 mirrored faces. 
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Fig. 9. Comparison of the training time between GSLDA and online 
GSLDA algorithms. The first and second GSLDA detectors are trained with 
5, 000 faces and 5, 000 non-faces, and 10, 000 faces and 10, 000 non-faces, 
respectively. Online GSLDA is initially trained with 5, 000 faces and 5, 000 
non-faces and updated with one million new patches. Notice that there is a 
slight increase in training time even though we incrementally update with 
200 X more training samples. 



model with unseen faces (+5000 faces) yields a better result 
than updating the model with new false positives from previous 
stages of the cascade (+10^ negative patches). The online 
classifier performs best when updated with both new positive 
and negative patches. Fig. |8(b)| shows a comparison when 
the number of initial training samples have been increased 



to 1000 faces. The performance gap between GSLDA and 
online GSLDA is now smaller. We observe the performance 
of both GSLDA and online GSLDA (+10^ negative patches) 
to be very similar. This indicates that the cascade learning 
framework proposed by Viola and Jones might have already 
incorporated the benefit of massive negative patches. Incre- 
mental learning with new negative instances do not seem to 
improve the performance of cascaded detectors any further. 
Another way to explain the results of our findings is to use 
the concept of linear asymmetric classifier (LAC) proposed in 
inn . In im . the asymmetric node learning goal is expressed 
as 

maximize Vy^^(^^y.^^{w^ x>w^\ , (20) 

subject to Pr^~(m2,5]2) {^ V < ^o} = f^- 

Since, the problem has no closed-form solution, the authors 
developed an approximate solution when (3 = 0.5. To find 
a closed- form solution, the authors assumed that w^x is 
Gaussian for any w, class C2 distribution is symmetric and 
the median value of the class C2 distribution is close to its 
mean. The direction w can then be approximated by 



maximize 



u^{mi - 7722) 



(21) 



From their objective functions, the only difference between 
FDA ([5]) and LAC ^T\} is that the pooled covariance matrix 
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of FDA, Hi + 112, is replaced by the covariance matrix of 
class Ci, El. In other words, when train the classifier with 
the asymmetric node learning goal for the cascade learning 
framework, the variance of negative classes becomes less 
relevant. In contrast, new instances of positive classes affect 
both the numerator and denominator in ( [2T] ). Hence, it is easier 
to notice the performance improvement when new positive 
instances are inserted. Our results seem to be consistent with 
their derivations. 

We further increase the number of initial training faces to 
5,000. All face detectors now seem to perform very similar 
to each other. We conjecture that this is the best performance 
that our cascaded detectors with the provided training set 
can achieve on MIT+CMU data sets. The results of the 
face detectors trained with 10,000 faces and 10,000 non- 



[4] 



[6: 



faces seem to support our assumptions (Fig. |8(c)| ). To further 
improve the performance, different cascade algorithms, e.g., 
soft-cascade (311 , WaldBoost |32|, multi-exit classifiers |[33]| , 
etc. and a combination with other types of features, e.g., edge 
orientation histograms (EOH) |34|, covariance features 1351 . 
etc., can also be experimented. Fig. [9] shows a comparison 
of the computation cost between batch GSLDA and online 
GSLDA. The horizontal axis shows the number of weak 
learners (decision stumps) and the vertical axis indicates the 
training time in minutes. From the figure, online learning 
is much faster than training a batch GSLDA classifier as 
the number of weak learners grows. On average, our online 
classifier takes less than 1.5 millisecond to update a strong 
classifier of 200 weak learners on standard off-the-shelf PC 
with the use of GNU scientific library (GSlQ 

IV. Conclusion 

In this work, we have proposed an efficient online object 
detection algorithm. Unlike many existing algorithms which 
applied boosting approach, our framework makes use of 
greedy sparse linear discriminant analysis (GSLDA) based 
feature selection which aims to maximize the class-separation 
criterion. Our experimental results show that our incremental 
algorithm does not only perform comparable to batch GSLDA 
algorithm but is also much more efficient. On USPS digits data 
sets, our online algorithm with decision stumps weak learners 
outperforms online boosting with class-conditional Gaussian 
distributions. Our extensive experiments on face detections 
reveal that it is always beneficial to incrementally train the 
detector with online samples. Ongoing works include the 
search for more accurate and efficient online weak learners. 
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